1. Front matter

Due Friday April 8, 5:00PM Central.

Initiate your repo: https://classroom.github.com/a/-PgTc6cw Submit by pushing your code to your repo on Github Classroom. This submission is my work alone and complies with the 30535 integrity policy. Add your initials to indicate your agreement: KEZG Add names of anyone you discussed this problem set with: NO Late coins used this pset: 0. Late coins left after submission: 5. We use (*) to indicate a problem that we think might be time consuming. Notes on submission (10 pts)

2. Git concepts

1. Git is software for distributed version control. List 4 benefits of distributed version control.

  • Backup copies in the cloud, which collects the editions of the document.
  • Being able to work offline. Git does not require an Internet connection. This helps the team to work at any time and place regardless of their connection. The run history is left on the hard drive.
  • Quick merge and flexible branching. Since there is no remote server, the code is merged faster. Also, the team can access different forking strategies, which is not possible with a centralized system.
  • Fast feedback and fewer merge conflicts.Having access to the history of changes helps to go back to previous versions and review errors. Also, since the repository is on a local workstation, you can quickly request code reviews. Finally, merge conflicts are less likely.

Source: https://about.gitlab.com/topics/version-control/benefits-distributed-version-control-system/

2. For the next questions, we will reference “Learn git concepts, not commands”. Read sections from “Overview” through “Branching”. It is written with git commandline in mind, but github desktop has all these features as well. Focus on the concepts (ie understand the pictures). Save some changes to your homework and make sure it’s being tracked by git with github deskop.

  1. What is the remote repository for this homework? Be as specific as possible.

ii. How do you add a file to staging in github desktop? (This is subtle, because it happens automatically).

  1. On GitHub.com, we can navigate to the main page of the repository. Then, above the list of files, we can click “Add file” and then click “Upload files”.

  2. We can add a file to the folder that contains our repository on our computer (when the repository is cloned on our computer), and then commit the file and push it to GitHub.

iii. How do you commit an issue to the local repository? (This is not subtle).

We have to go to the Github application on our computer, and then click on the “commit to main” option, previously we have the option to give the commit a summary and a description.

iv. How does github desktop decide what part of your code to show in the main part of the window?

Github desktop shows in the main part of the screen, the changes made to the file. The lines of code appear in red or green to indicate deletes or additions of code.

v. What branch are you on right now? Why?

I am now on the main branch, this branch is assigned by default at the start.

vi. If you were to click on “current branch”, type a name and click the “New Branch” button, you would create a new branch.

a. What would happen to the files in your working directory?

Now in “current branch” there are two branches, and therefore a file can be edited separately in this second branch without this implying a change in the other branch (a change in “New Branch” would not be reflected in “Main”).

b. What would happen in the remote repo?

Two branches and two versions of the code appear in the remote GitHub repository.

c. Why would you want to work on a different branch?

When it is important to preserve the original code into a chain of changes, for reference purposes. It may be that the main branch is a model version of what you are looking for, and will serve to guide all future codes and changes

If you created a new branch in the previous step, nice! Experimentation develops your skills and understanding.Now, make sure you are on master as you continue your homework.

3. Practicing with dplyr verbs

3.1 Debugging mindset (5 pts)

1. Why doesn’t this code work?

my_variable <- 10 my_variab1e

This formula is not working because the name of the variable created is “my_variable” and not “my_variab1e” (a 1 instead of an “l”), it is necessary to call the variable by the same name assigned to have the result ‘10’.

2. Tweak each of the following R commands so that they run correctly:

  1. library(tidilyverse)

  2. ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

  3. fliter(mpg, cyl = 8)

  4. filter(diamond, carat > 3)

Modifying:

    1. The package name is ‘tidyverse’ and not ‘tidilyverse’
library(tidyverse)
    1. There is a typo, saying ‘dota’ instead of ‘data’
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

    1. There are two error here, (1) the function is filter() and not fliter(). Also, we need double equal sign to establish equality
filter(mpg, cyl == 8)
    1. The correct data frame name is ‘diamonds’ and not ‘diamond’
filter(diamonds, carat > 3)

3. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

The RStudio keyboard shortcuts appears on screen. We can get the same results going to menu at the top: Tools > Keyboard Shortcuts Help.

3.2 Transformations: Filter rows with filter() (10 pts)

1. Using the flights data from the nycflights13 package. Find all flights that:

library(nycflights13)
view(flights)
?flights

i. Had an arrival delay of three or more hours

filter(flights,arr_delay/60>=3)

ii. Flew to Houston (IAH or HOU)

filter(flights,dest == "IAH" | dest == "HOU")

iii. Were operated by United, American, or Southwest

  • Looking up airline names from their carrier codes
view(airlines)
filter(flights,carrier == "UA" | carrier == "AA" | carrier == "WN")

iv. Departed in spring (March, April, May and June)

filter(flights,month == 3 | month == 4 | month == 5 | month == 6)

v. Arrived more than two hours late, but didn’t leave late

filter(flights,arr_delay/60>2 & dep_delay <= 0)

vi. Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights,dep_delay/60>=1 & dep_delay - 30 > arr_delay)

vii. Departed between midnight and 5am (inclusive)

filter(flights, dep_time>0 & dep_time<=500)

2. We reproduced a graph from the mpg data. There is one green triangle point that has very low gas mileage. What car is it?

view(mpg)
filter(mpg, drv == "f" & displ<4 & cty<15)

1. Common bugs: You want to see missing values in a dataframe. You run the following code and get no results. Why is wrong?

filter(flights, arr_time == NA)

It’s wrong because, to check for missing values we have to use the code “is.na” and not filter.

2. How many flights have a missing dep_time? Among these flights, what other variables are missing? What might these rows represent?

  • These rows can represent flights that never departed, or in other words, that were scheduled but later cancelled.

  • There are 8,255 flights that have a missing departure time.

sum(is.na(flights$dep_time))
## [1] 8255
  • For the flights that have missing departure time, 5 other variables have missing values:dep_delay (8255), arr_time (8255), arr_delay (8255), tailnum (2512), air_time (8255)
filter(flights, is.na(dep_time))
  • Among missing dep_time, variables by total missing values:
x <-filter(flights, is.na(dep_time))
sapply(x, function(x) sum(is.na(x)))
##           year          month            day       dep_time sched_dep_time 
##              0              0              0           8255              0 
##      dep_delay       arr_time sched_arr_time      arr_delay        carrier 
##           8255           8255              0           8255              0 
##         flight        tailnum         origin           dest       air_time 
##              0           2512              0              0           8255 
##       distance           hour         minute      time_hour 
##              0              0              0              0

Source: https://discuss.analyticsvidhya.com/t/how-to-count-the-missing-value-in-r/2949/13

3. What does NA | TRUE evaluate to? Why?

NA | TRUE
## [1] TRUE

Evaluate for true, giving that the logical connector is “|” “or” and because at least one side it is TRUE

4. What does FALSE & NA evaluate to? Why?

FALSE & NA
## [1] FALSE

Evaluate for false, because NA represents something that is ambiguous or unknown (it could be true or false), and given that part of it is already FALSE, the result will be FALSE

3.3 Transformations: Select columns with select() (10 pts)

1. What happens if you include the name of a variable multiple times in a select() call?

select(flights, day, month, day, year, day, year, dep_time)

If someone includes the name of variables multiple times, select just show it once, in the position where it first was included.

2. Typically, R is case-sensitive, but select helpers ignore case by default. Change the default to return an empty tibble.

select(flights, contains(“TIME”))

select(flights, contains("TIME", ignore.case=FALSE))

Source: https://stackoverflow.com/questions/8361589/turning-off-case-sensitivity-in-r

3. Brainstorm as many distinct ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

Four ways:

select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with("dep_"), starts_with("arr_"))
select(flights, contains("_time") & -(contains("sched")) & -(contains("air")) | contains("delay"))
var <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights,var)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(var)` instead of `var` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

3.4 Transformations: Sort columns with arrange() (5 pts)

1. Sort flights to find the most delayed flights by arrival time. End your code with %>% head(1) to only print the most delayed row.

arrange(flights, desc(arr_delay)) %>% head(1)

2. Find the top 5 flights that left earliest relative to their scheduled departure. Only show the tail number, date and number of minutes early they departed at.

arrange(flights, dep_delay) %>% head(5) %>% select(tailnum,year, month, day, dep_delay)

3. Using the flight data, how could you use arrange() to sort all missing values in a certain variable first, before the rest of the data? (Hint: use is.na()).

arrange(flights, desc(is.na(dep_time)))

3.5 Transformations: Add new variables with mutate() (10 pts)

**1.Currently dep_time and arr_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to number of minutes since midnight in new vars called _time_min, where here is variable and could be dep or arr**

flights <- mutate(flights, dep_time_min = (flights$dep_time %/%100)*60 + flights$dep_time %%100)

flights <- mutate(flights, arr_time_min = (flights$arr_time %/%100)*60 + flights$arr_time %%100)
  • Showing the work done:
flights %>%
  select(tailnum, dep_time, arr_time, dep_time_min, arr_time_min)

2. Make a plot to illustrate the distribution of the difference between air_time with arr_time_min - dep_time_min. What do you notice about the distribution?

flights <- mutate(flights, diff= flights$arr_time_min - flights$dep_time_min)
view(flights)
ggplot(data = flights) + geom_point(mapping = aes(x=air_time, y= diff))

We would expect the value of air_time and the difference of arr_time_min - dep_time_min to be the same; however, from the distribution we can notice that the values are very far from each other in addition to being scattered. Therefore, we can conclude that there is a problem in how we are analyzing the data, since the points are very different from each other.

3. * We expect air_time and arr_time_min - dep_time_min to be the same, but in many cases they are not.

i. What fraction of flights have different values for these two columns?

sum(flights$air_time == flights$diff, na.rm = TRUE)
## [1] 196
sum(flights$air_time == flights$diff, na.rm = TRUE) / sum(!is.na(flights$diff))
## [1] 0.0005974462

ii. Why not? Identify an egregious data issue which causes an error of several hours.

I would expect the numbers to be the same, but the numbers don’t match because we aren’t accounting for flights where the departure time is close to midnight and the arrival time is the next day. In minutes, our max is 1440, after that the minutes start from zero again.

iii. Identify an additional issue that causes smaller errors

The difference may be due to the different time zones between the city of origin and the city of destination.

iv. Bonus: identify a third issue that causes smaller errors

It could be that the air time is not taking into account other factors such as the movement of an aircraft on the ground, taxiing, when arriving and departing.

4. If air_time is too far off from arr_time - dep_time then we assume one of the columns contains an error. Implement your fix to the major problem. Remake the plot from two questions prior.

flights <- mutate(flights, diff2 = ifelse(flights$arr_time_min >flights$dep_time_min,  flights$arr_time_min - flights$dep_time_min, 1440 - flights$dep_time_min + flights$arr_time_min))
view(flights)
ggplot(data = flights) + geom_point(mapping = aes(x=air_time, y= diff2))

From the graph, we can see that the points are now much closer to each other. However, we might still have other problems like the ones listed above.

5. After implementing your fix, what fraction of observations still appear to have an error?

sum(flights$air_time == flights$diff2, na.rm = TRUE)
## [1] 201
sum(flights$air_time == flights$diff2, na.rm = TRUE) / sum(!is.na(flights$diff2))
## [1] 0.0006126872

6. Find the 10 least delayed flights using a ranking function. Carefully read the documentation for min_rank(). There may be more than 10 flights that are in the top 10! Why?

flights %>% 
  mutate(rank_dep_delay = min_rank(dep_delay)) %>% 
  arrange(rank_dep_delay) %>% 
  filter(rank_dep_delay <=10)
  • Selecting just the columns of interest:

From the result, we can see that there are more than 10 flights that are in the top 10 this is because the code min_rank assign the ties values the same rank. We can see that there are 4 flights that are tie in the 9 rank and two that are tie in the rank 7.

3.6 Transformations: Summarizing grouped data (15 points)

1. In lecture, we defined a not_cancelled data frame. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt= distance) (without using count()).

not_cancelled <- filter(flights,!is.na(dep_delay),!is.na(arr_delay))
  • Getting the same result as not_cancelled %>% count(dest) without count:
not_cancelled %>% group_by(dest) %>% summarise(n())
  • Getting the same result as not_cancelled %>% count(tailnum, wt= distance) without count:
not_cancelled %>% group_by(tailnum) %>% summarise(sum(distance))

2. Calculate average delays by destination for flights originating in NYC. Create a variable which ranks destinations from worst to best, where 1 is the worst rank.

  • Calculating by average departure delay (Assuming EWR is not in NYC):
flights %>% 
  filter(origin == "LGA" | origin == "JFK") %>% #Assuming EWR is not in NYC
  group_by(dest) %>%
  summarise(average_delay = mean(dep_delay, na.rm = TRUE)) %>%
  mutate(worstrank = min_rank(desc(average_delay)))%>%
  arrange(worstrank) #From worst ranking (1) to the best ranking

3. What hour of the day should you fly if you want to avoid arrival delays as much as possible?

flights %>% 
  group_by(hour) %>%
  summarise(mean_arrdelay = mean(arr_delay,  na.rm = TRUE)) %>%
  arrange(mean_arrdelay)

To avoid arrival delay as much as possible, I should leave at 7 am. Likewise, it can be seen that the hours between 5 am and 9 am have a negative arrival delay, meaning that those flights had an early arrival.

4. Which plane (tailnum) has the most minutes of delays total?

Taking into account the question reference the departure delay:

flights %>% 
  group_by(tailnum) %>%
  summarise(sum_depdelay = sum(dep_delay,  na.rm = TRUE)) %>%
  arrange(desc(sum_depdelay)) %>%
  head(1)

5. Find all destinations that are flown by at least three carriers. Order the results from most carriers to fewest.

  • Destinations and number of carriers that flown there
flights %>% 
  group_by(dest) %>%
  summarise(number_carriers = length(unique(carrier))) %>%
  arrange(desc(number_carriers)) %>%
  filter(number_carriers>=3)
  • Ordering number of carrier by total of destination that are flown by at least three carriers
flights %>% 
  group_by(dest) %>%
  mutate(number_carriers = length(unique(carrier))) %>%
  filter(number_carriers>=3) %>%
  group_by(carrier) %>%
  summarise(number_dest = length(unique(dest))) %>%
   arrange(desc(number_dest))

6. Calculate the number of non-cancelled flights by each carrier. Report the results in table in reverse alphabetical order with the min() and max() distance each airline covers.

flights %>%
  group_by(carrier) %>%
  filter(!is.na(dep_time), !is.na(arr_time))  %>%
  summarise(non_cancelled = n(), min(distance), max(distance)) %>%
  arrange(desc(carrier))

7. Which airlines only offer flights from New York state to one other airport?

  • Excluding EWR because is not part of New York state
flights %>% 
  filter(origin == "LGA" | origin == "JFK") %>% #Assuming EWR is not in NYC
  group_by(carrier) %>%
  summarise(number_dest = length(unique(dest))) %>%
  filter(number_dest == 1)

Therefore, F9 and HA airlines offer flights from New York state to just one other airport.

3.7 Transformations: Practical application (10 pts)

**1. *We posit that arrival delays are more important than departure delays from a passanger’s perspective. We are interested in determining what airlines to avoid based on their delay characteristics.**

a. Calculate the median arrival delay by airline.

flights %>%
  group_by(carrier) %>%
  summarise(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(median_arr_delay))

b. Median arrival delay may not be the best metric for our needs. Brainstorm at least 2 additional ways to assess the typical delay characteristics of a group of flights.

  • Using the mean flight arrival delay per airline
  • Using the percentage of flights that have an arrival delay greater than 30 min per airline

c. Write the dplyr code which calculates your delay measures separately by airline. Do particular airlines perform poorly on multiple metrics?

  • Using the mean flight arrival delay per airline
flights %>%
  group_by(carrier) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(mean_arr_delay))

Base on their arrival delay characteristics, the five airlines that have the highest mean delay time are: F9, FL, EV, YV, OO. Therefore, it is recommended that those five be avoided.

  • Using the percentage of flights that have an arrival delay greater than 30 min per airline
flights %>%
  group_by(carrier) %>%
  summarise(percentage = sum(arr_delay >30, na.rm=TRUE)/ length(!is.na(arr_delay)) * 100) %>%
  arrange(desc(percentage))

Base on their arrival delay characteristics, the five airlines that have the highest percentage of delay by more than 30 min are: F9, EV, YV, FL, OO. Therefore, it is recommended that those five be avoided.

3.8 Transformations: Grouped operations with mutate() (and filter()) (15points)

1. * Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.

  • a. Order flights by departing airport, arrival airport, day, and scheduled departure time. For each flight, use lag() and group_by() to compute the delay on the previous flight – if there is such a flight on the same day.
dataframe1 <- flights %>%
  arrange(origin, dest, month, day, sched_dep_time) %>%
  group_by(origin) %>%
  filter(!is.na(dep_time), !is.na(arr_time)) %>%
  mutate(lag_depdelay = lag(dep_delay))
dataframe1

Months have been added in arrange, because otherwise the order is only done by day, causing errors with the lag variable

  • b. Make a plot which shows the relationship between a flight’s delay and the delay of the immediately preceding scheduled flight. You have a lot of data, so think carefully about how to develop a plot which is not too cluttered.
ggplot(data = dataframe1) + geom_point(mapping = aes(x=lag_depdelay, y=dep_delay)) + labs(x = "Previus Departure Delay Flight", y=  "Departure Delay Flight")

  • Not too cluttered pplot: measuring the average flight delays, after a given delay time for a previous flight
dataframe1 %>%
  group_by(lag_depdelay) %>%
  summarise(mean_depdelay = mean(dep_delay))  %>%
  ggplot() + geom_point(mapping = aes(x=lag_depdelay, y=mean_depdelay)) + labs(x = "Previus Departure Delay Flight", y=  "Mean Departure Delay Flight")

2. Now we will look at delays that occur in the air. We will need a sense of how long a flight is.

  • a. Compute the air time for each flight relative to the median flight to that destination. Which flights were most delayed in the air?
flights %>%
  group_by(dest) %>%
  mutate(median_air_time = median(air_time,na.rm = TRUE)) %>%
  mutate(most_delayed = air_time - median_air_time) %>%
  select(carrier, flight, dest, air_time, median_air_time, most_delayed) %>%
  arrange(desc(most_delayed))

In the table, we can see the from highest to lowest the flights that have been most late with respect to the median of the destination. Also, only the relevant columns have been selected to show.

  • Using filter to have the most delayed flights relative to the median:

3. For each plane, count the number of flights before the first delay of greater than 1 hour. (Hints: Construct a Boolean variable for every flight which measures whether it had a delay of greater than 1 hour and then use cumsum).

flights %>%
  arrange(month, tailnum, day, dep_delay) %>%
  group_by(tailnum) %>%
  mutate(boolean_va = cumsum(dep_delay)<60) %>%
  filter(boolean_va == 'TRUE') %>%
  summarise(numberflights_beforefirstdelay = n())